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Abstract 

Recently, multilayer bootstrap network (MEN) has demonstrated promising performance 
in unsupervised dimensionality reduction. It can learn compact representations in stan¬ 
dard data sets, i.e. MNIST and RCVl. However, as a bootstrap method, the prediction 
complexity of MEN is high. In this paper, we propose an unsupervised model compression 
framework for this general problem of unsupervised bootstrap methods. The framework 
compresses a large unsupervised bootstrap model into a small model by taking the boot¬ 
strap model and its application together as a black box and learning a mapping function 
from the input of the bootstrap model to the output of the application by a supervised 
learner. To specialize the framework, we propose a new technique, named compressive 
MEN. It takes MEN as the unsupervised bootstrap model and deep neural network (DNN) 
as the supervised learner. Our initial result on MNIST showed that compressive MEN 
not only maintains the high prediction accuracy of MEN but also is over thousands of 
times faster than MEN at the prediction stage. Our result suggests that the new tech¬ 
nique integrates the effectiveness of MEN on unsupervised learning and the effectiveness 
and efficiency of DNN on supervised learning together for the effectiveness and efficiency 
of compressive MEN on unsupervised learning. 
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1. Introduction 

Dimensionality reduction is a core problem of machine learning, where classification and 
clustering can be regarded as its special cases that reduce high dimensional data to discrete 
points. In this paper, we focus on unsupervised learning. Traditionally, dimensionality 
reduction can be categorized to kernel methods, neural networks, probabilistic models, and 
sparse coding. Kernel methods are too costly on large-scale problems. Although neural 
networks are scalable to large scale data, they double the computational complexity by a 
bottleneck structure and take the input as the output of the bottleneck structure at the 
training stage which is slow, moreover, they learn data distribution globally which is not 
very effective on learning local structures. 

Multilayer bootstrap network (MEN) is a recently proposed bootstrap method (or un¬ 
supervised ensemble method). It has multiple nonlinear layers. Each layer is an ensemble 
of fe-centers clusterings. The centers of each /c-centers clustering are only randomly sampled 
data points (called a bootstrap sample) from the input. MEN is easily implemented and 
trained, and scales well to large-scale problems as neural networks at the training stage. 
Moreover, MEN learns a data distribution locally so that it can learn effective representa- 
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tions of data easily. However, MEN contains hundreds of clusterings, which is difficult to 
be used for prediction. 

Motivated by the aforementioned problem and the recent progress of compressing en¬ 
semble classifiers to a single small classifier in supervised learning (Bucilu et al. (2006); Hin¬ 
ton et al. (2015)), in this abstract paper, we propose an unsupervised model compression 
framework. The framework uses a supervised model to approximate the mapping function 
from the input of an unsupervised bootstrap method to the output of the application of 
the unsupervised bootstrap method. We further specify the framework by taking MEN 
as the unsupervised bootstrap method and DNN as the supervised model. The proposed 
method is named compressive MBN. To our best knowledge, this is the first work of model 
compression for bootstrap methods on unsupervised learning. 

2. Methods 

Compressive MEN is as follows: 

• The first step trains MEN on a give training set, and outputs the low dimensional 
representation of the training data points. 

• [A step driven by applications] The second step applies the low dimensional repre¬ 
sentation to a given application in unsupervised learning, and outputs the prediction 
result of the training data. 

• The third step trains a DNN with the training set as the input and the prediction 
result as the target. Finally, the DNN model will be used for prediction. 

The algorithm is a very basic framework. We can easily extend compressive MBN to 
other techniques by simply using other unsupervised bootstrap techniques to replace MBN 
in the first step for potentially better performance. We can also use many other supervised 
learners to replace DNN in the third step, but to our knowledge, DNN is currently already 
a good choice. 

We may also design a lot of new algorithms by simply specifying the second step for 
different applications. Some examples are as follows, (i) When compressive MBN is used 
for visualization, we may omit the second step, and simply take the input and output of the 
MBN as the input and output of DNN respectively, (ii) When compressive MBN is used for 
unsupervised prediction, we may run a hard clustering algorithm on the training set, and 
get the predicted indicator vector of each training data point. For example, if a data point 
is assigned to the second cluster, then its predicted indicator vector is [0,1, 0, 0,..., Oj. We 
may also get the probabilistic output of the clustering. 

3. Experiments 

We conducted an initial experiment on MNIST. We showed that the technique is very helpful 
for reducing the high computational cost of MBN on unsupervised prediction problems. The 
MNIST data was normalized by dividing its entries by 255. 
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MBN on a small subset of MNIST 
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Figure 1: MBN. Its prediction time on the 5000 images is 2333.24 seconds. 


Compressive MBN on a small subset of MNIST 







0 

*. 

2 

"H- 

3 

• 

4 


5 


6 

w 

7 

♦ 8 


9 


Figure 2: Compressive MBN. Its prediction time on the 5000 images is 0.58 seconds. 


3.1. Experiment on visualizing small subsets of MNIST 

In this subsection, we did not consider the generalization ability of MBN and compressive 
MBN. Instead, we studied their visualization ability. A data set that contained 5000 unla¬ 
beled images randomly selected from the training set of MNIST was used for both training 
and test. 

For the MBN training, we trained MBN similarly as in the second experiment in (Zhang 
(2014)). Specifically, the number of clusterings in each layer was set to 400. The parameters 
k from layer 1 to layer 9 were set to 4000-2000-1000-500-250-125-65-30-15 respectively. As 
we can see, MBN is a very large sparse model: (4000-2000-1000-500-250-125-65-30-15) x400. 
The parameter a for random feature selection was set to 0.5. The parameter r for random 
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reconstruction was set to 0.5. After getting the high-dimensional sparse representation from 
MBN at the top hidden layer, we mapped it to two dimensional space by the expectation- 
maximization principle component analysis (EM-PCA) (Roweis (1998)). 

For the training of compressive MBN, we omitted the second step, and used the 
input and output of MBN as the input and output of a DNN model respectively. The 
parameter settings are as follows. We trained a 786-2048-2048-2 DNN. The dropout rate 
was set to 0.2. The rectified linear unit was used as the hidden unit, and the linear function 
was used as the output unit. The number of the training epoches was set to 120. The batch 
size was set to 32. The learning rate was set to 0.001. 

The two dimensional visualizations produced by MBN and compressive MBN were 
shown in Fig. 1 and Fig. 2 respectively. From the two figures, we found that the vi¬ 
sualizations of both MBN and compressive MBN by DNN were equivalently good. When 
we used the features for clustering, the NMIs of both the methods were around 81%. Amaz¬ 
ingly, the prediction time of the compressive MBN on the 5000 images was only 
0.58 seconds, which accelerated the prediction time of MBN by around 4000 
times!^ 

3.2. Experiment on the full MNIST 

We used all 60,000 training images for unsupervised model training and 10,000 test images 
for test. We discussed the unsupervised generalization ability of the compressive MBN on 
the test images. 

3.2.1. Compressive MBN without random reconstruction (i.e. parameter 
r = 0) 

For the MBN training, we trained MBN similarly as in the third experiment in (Zhang 
(2014)). Specifically, the number of clusterings in each layer was set to 400. The parameters 
k from layer 1 to layer 9 were set to 4000-2000-1000-500-250-125-65-30-15 respectively. As 
we can see, MBN is a very large sparse model: (4000-2000-1000-500-250-125-65-30-15) x400. 
The parameter a for random feature selection was set to 0.5. The parameter r for random 
reconstruction was set to 0. After getting the high-dimensional sparse representation from 
MBN at the top hidden layer, we mapped it to 5 dimensional space by EM-PCA (Roweis 
(1998)). We further encoded the 5-dimensional representations to 10-dimensional indicator 
vectors by /c-means clustering, which was a specialization of the second step of compressive 
MBN. 

For the training of compressive MBN, we took the raw feature of the training set as 
the input of DNN, and took the 10-dimensional predicted indicator vectors as the training 
target of DNN. The parameter settings of the DNN were as follows. We trained a 786- 
2048-2048-10 DNN. The dropout rate was set to 0.2. The rectified linear unit was used as 
the hidden unit, and the sigmoid function was used as the output unit. The number of the 
training epoches was set to 50. The batch size was set to 128. The learning rate was set to 
0 . 001 . 

Because /c-means clustering suffers from local minima, we ran the aforementioned meth¬ 
ods 10 times and recorded the average results. The experimental comparison between MBN 

1. MBN did not enable parallel computing. 
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MBN & Compressive MBN without random reconstruction 



Figure 3: Comparison of the generalization ability of MBN and compressive MBN on clus¬ 
tering when the random reconstruction of MBN is not used (i.e. r = 0). The 
clustering accuracy is evaluated by normalized mutual information. The predic¬ 
tion time of MBN on the 10,000 test images is 4699.69 seconds. The prediction 
time of compressive MBN on the 10,000 test images is 1.15 seconds. 


MBN & Compressive MBN with random reconstruction 



Figure 4: Comparison of the generalization ability of MBN and compressive MBN on clus¬ 
tering when the random reconstruction of MBN is not used (i.e. r = 0.5). The 
prediction time of MBN on the 10,000 test images is 4857.45 seconds. The pre¬ 
diction time of compressive MBN on the 10,000 test images is 1.10 seconds. 


and compressive MBN was summarized in Fig. 3. From the figure, we observed that the 
curves of the training accuracy of MBN and compressive MBN were completely coincident; 
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moreover, the curve of prediction of compressive MBN was even slightly better than that 
of MBN; the highest prediction accnracy of compressive MBN reached 84% in terms of 
NMI. The most advanced property of compressive MBN is that it needed only 1.15 seconds 
to predict 10,000 images, while MBN needed 4699.69 seconds to predict 10,000 images. The 
prediction time was accelerated by around 4000 times. 

3.2.2. Compressive MBN with random reconstruction (i.e. parameter r = 0.5) 

It is shown in (Zhang (2014)) that when the data is small scale (i.e. the training size 
was similar to the largest parameter k), the random reconstruction operation can be quite 
helpful, however, it is still unclear that whether random reconstruction will be helpful when 
the data is large scale (i.e. the training size is much larger than the largest parameter 
k), since the largest k in the third experiment of (Zhang (2014)) was only 1000 and the 
experimental results of MBN with or without random reconstruction was not very exciting. 
In this subsection, we enlarged k to 4000 as in Section 3.2.1. 

The experimental settings of both MBN and compressive MBN were the same as in 
Section 3.2.1 except that we set r = 0.5 and mapped the sparse features to 2 dimensional 
space by EM-PCA. The experimental results were summarized in Fig. 4. From the figure, 
we observed that all experimental conclusions in Section 3.2.1 could also be summarized 
here, except that when random reconstruction was used, the performance of both MBN and 
compressive MBN was not as good as that without random reconstruction. 

4. Conclusions 

In this paper, we proposed a general framework for unsupervised model compression. The 
framework takes MBN (Zhang (2014)) as a case of study. The specialized technique, named 
compressive MBN, uses DNN as an auxiliary model for modeling the mapping function 
from the input of MBN to the prediction result of a given application that takes the low 
dimension output of MBN as its input. The new technique aims to solve the problem 
that although MBN is simple, effective, robust, and efficient-at-the-training-stage, it is time 
consuming on prediction. 

Our initial experimental result on MNIST showed that compressive MBN not only 
inherited the generalization ability of MBN (and is even slightly better than MBN), but 
also accelerated the prediction efficiency of MBN by over thousands of times. 

Compressive MBN concatenates the effectiveness of MBN on unsupervised learning and 
the effectiveness and efficiency of DNN on supervised learning together for both its effective¬ 
ness and its efficiency on unsupervised learning. Moreover, we can easily extend compressive 
MBN to other unsupervised model compression techniques. 
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